nullnull 1 2
Frequentist Regret Analysis of Gaussian Process Thompson Sampling via Fractional Posteriors
Roy, Somjit, Jaiswal, Prateek, Bhattacharya, Anirban, Pati, Debdeep, Mallick, Bani K.
We study Gaussian Process Thompson Sampling (GP-TS) for sequential decision-making over compact, continuous action spaces and provide a frequentist regret analysis based on fractional Gaussian process posteriors, without relying on domain discretization as in prior work. We show that the variance inflation commonly assumed in existing analyses of GP-TS can be interpreted as Thompson Sampling with respect to a fractional posterior with tempering parameter $α\in (0,1)$. We derive a kernel-agnostic regret bound expressed in terms of the information gain parameter $γ_t$ and the posterior contraction rate $ε_t$, and identify conditions on the Gaussian process prior under which $ε_t$ can be controlled. As special cases of our general bound, we recover regret of order $\tilde{\mathcal{O}}(T^{\frac{1}{2}})$ for the squared exponential kernel, $\tilde{\mathcal{O}}(T^{\frac{2ν+3d}{2(2ν+d)}} )$ for the Matérn-$ν$ kernel, and a bound of order $\tilde{\mathcal{O}}(T^{\frac{2ν+3d}{2(2ν+d)}})$ for the rational quadratic kernel. Overall, our analysis provides a unified and discretization-free regret framework for GP-TS that applies broadly across kernel classes.
Kernel ridge regression under power-law data: spectrum and generalization
Wortsman, Arie, Loureiro, Bruno
In this work, we investigate high-dimensional kernel ridge regression (KRR) on i.i.d. Gaussian data with anisotropic power-law covariance. This setting differs fundamentally from the classical source & capacity conditions for KRR, where power-law assumptions are typically imposed on the kernel eigen-spectrum itself. Our contributions are twofold. First, we derive an explicit characterization of the kernel spectrum for polynomial inner-product kernels, giving a precise description of how the kernel eigen-spectrum inherits the data decay. Second, we provide an asymptotic analysis of the excess risk in the high-dimensional regime for a particular kernel with this spectral behavior, showing that the sample complexity is governed by the effective dimension of the data rather than the ambient dimension. These results establish a fundamental advantage of learning with power-law anisotropic data over isotropic data. To our knowledge, this is the first rigorous treatment of non-linear KRR under power-law data.
Some Robustness Properties of Label Cleaning
We demonstrate that learning procedures that rely on aggregated labels, e.g., label information distilled from noisy responses, enjoy robustness properties impossible without data cleaning. This robustness appears in several ways. In the context of risk consistency -- when one takes the standard approach in machine learning of minimizing a surrogate (typically convex) loss in place of a desired task loss (such as the zero-one mis-classification error) -- procedures using label aggregation obtain stronger consistency guarantees than those even possible using raw labels. And while classical statistical scenarios of fitting perfectly-specified models suggest that incorporating all possible information -- modeling uncertainty in labels -- is statistically efficient, consistency fails for ``standard'' approaches as soon as a loss to be minimized is even slightly mis-specified. Yet procedures leveraging aggregated information still converge to optimal classifiers, highlighting how incorporating a fuller view of the data analysis pipeline, from collection to model-fitting to prediction time, can yield a more robust methodology by refining noisy signals.
Supplementary Material A Proof of Paper Results
We now consider the gradient of the log-variance loss. Using the definition from Eq. 5, we see that From [Reiss, 2012, Lemma A.3.5] we have the bound Combining these estimates we arrive at the claimed result. The claim follows by direct calculation. In fact, it is possible to take K = 15 . Lemma 3 shows that the kurtosis term in our bound Eq. 16 can be bounded for Gaussian families.
Dimension-free bounds in high-dimensional linear regression via error-in-operator approach
Noskov, Fedor, Puchkin, Nikita, Spokoiny, Vladimir
In contrast to the standard intuition that a learner should search for a trade-off between approximation and estimation errors, researchers empirically observed that large interpolating rules may still have small test error. Moreover, when the number of parameters exceeds sample size the prediction risk of neural networks passes a U-shaped curve and decreases again [Zhang et al., 2017, Nakkiran et al., 2020]. A bit later it became clear that benign overfitting and double descent are not distinctive features of deep learning. Similar phenomena are ubiquitous for overparametrized models such as random forests and random feature models [Belkin et al., 2019, Mei and Montanari, 2022], kernel methods [Belkin et al., 2018, Liang and Rakhlin, 2020], and linear regression [Bartlett et al., 2020, Hastie et al., 2022] to name a few. In [Belkin et al., 2018], the authors reasonably suggested that we must study more tractable "shallow" methods better before diving into deep learning theory. In the present paper, we consider a classical linear regression problem, where a learner aims to estimate an unknown vector θ R d from i.i.d.
Estimates on Learning Rates for Multi-Penalty Distribution Regression
This paper is concerned with functional learning by utilizing two-stage sampled distribution regression. We study a multi-penalty regularization algorithm for distribution regression under the framework of learning theory. The algorithm aims at regressing to real valued outputs from probability measures. The theoretical analysis on distribution regression is far from maturity and quite challenging, since only second stage samples are observable in practical setting. In the algorithm, to transform information from samples, we embed the distributions to a reproducing kernel Hilbert space $\mathcal{H}_K$ associated with Mercer kernel $K$ via mean embedding technique. The main contribution of the paper is to present a novel multi-penalty regularization algorithm to capture more features of distribution regression and derive optimal learning rates for the algorithm. The work also derives learning rates for distribution regression in the nonstandard setting $f_{\rho}\notin\mathcal{H}_K$, which is not explored in existing literature. Moreover, we propose a distribution regression-based distributed learning algorithm to face large-scale data or information challenge. The optimal learning rates are derived for the distributed learning algorithm. By providing new algorithms and showing their learning rates, we improve the existing work in different aspects in the literature.